This assignment is aim to solve the problems of for Mini Challenge 2
The global settings of R code chunks in this post is set as follows.
The following code input is to prepare for R Packages Installation.
# !diagnostics off
packages = c('raster','sf','tmap', 'clock','DT', 'ggiraph', 'plotly', 'tidyverse','dplyr','readr','hrbrthemes','tmap','mapview')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The following code is to import raw data sets from Mini Challenge2(“car-assignment.csv”,“cc_data.csv”,“gps.csv”,“loyalty_data.csv”).
credit_debit <- read_csv("data/cc_data.csv")
loyalty_data <- read_csv("data/loyalty_data.csv")
car_assignment <- read_csv("data/car_assignments.csv")
GPS <- read_csv("data/gps.csv")
Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies?
Comparison of total amount between credit/debit card and loyalty card
To know which place is most populated and when it is populated, we need a data table to list the most populated places and its time.
# !diagnostics off
loyalty_data$count_event=1
credit_debit$count_event=1
aggregate_dataset <- loyalty_data %>%
group_by(timestamp,location) %>%
dplyr::summarize(Frequency = sum(count_event),Money_loyalty=sum(price))
credit_debit$timestamp <- strptime(credit_debit$timestamp, "%m/%d/%Y %H:%M")
aggregate_cc <- credit_debit %>%
group_by(timestamp,location) %>%
dplyr::summarize(Frequency = sum(count_event),Money_cd=sum(price))
Comparison <- full_join(cc_money, loyalty_money, by = c('Day','location'))
Comparison[is.na(Comparison)] <- 0
Comparison$Money_dif=Comparison$money_cc-Comparison$money_loyal
Comparison$Freq_dif=Comparison$freq_cc-Comparison$freq_loyal
Comparison<-Comparison%>%
arrange(freq_cc)
datatable(Comparison,rownames = FALSE)
During data exploration, we can see there are five records that don’t have any cost in credit card and debit card,but there are consumption records in loyalty card.
Table 2.2Combination of loyalty_money and cc_moneyFrom the new data frame “Result1”, Now we can see that Katerina’s Cafe is the most popular place based on data records from Day 6 to Day 19,which appears 6 times in 14 days records.
To find out more anomalies from the data, we need more obvious data visualization.
new column: text for tooltip
Comparison$Money_dif <- round(Comparison$Money_dif ,2)
Comparison <- Comparison %>%
mutate(text = paste0("Location: ", location, "\n", "Day of January: ", Day, "\n", "Money Difference: ",Money_dif))
Comparison <- Comparison%>%
mutate(text2 = paste0("Location: ", location, "\n", "Day of January: ", Day, "\n", "Frequency Difference: ",Freq_dif))
Heat map of money difference
p <- ggplot(data = Comparison, aes(x=Day, y=location,fill=Money_dif,text=text)) +
geom_tile() +
geom_text(aes(label = Money_dif)) +
scale_fill_gradient(low="pink", high="blue") +
theme_ipsum()
p <- p + theme(axis.text.y = element_text(size = 8))
ggplotly(p, tooltip="text")
Figure 1 Money difference
Figure:Heat map of frequency difference
z <- ggplot(data = Comparison, aes(x=Day, y=location,fill=Freq_dif,text=text2)) +
geom_tile() +
scale_fill_gradient(low="light yellow", high="red") +
geom_text(aes(label = Freq_dif))+
theme_ipsum()
z <- z + theme(axis.text.y = element_text(size = 8))
ggplotly(z, tooltip="text2")
Figure 2 Frequency difference
we can see more anomalies comparing these two heat maps: 1.In these two weeks,except Maximum Iron and steel which the differences in money and frequencies are both 0, other places in these two weeks all appear difference in some days either in frequency or money.
2.In Frydos Auto Supply on Day 13, it has a large cost consumption of 9912.43 but the heat map of frequency difference shows 0 in frequency difference. Through DT function to trace back the raw data, we can find that loyalty card shows total consumption is 542.79,and credit and debit cards show money spent is 10455.22,but both their number of consumption record is the same. It is quite strange.
3.Another anomaly is from data table 4.1, there are five records showing that credit card and debit card consumption cost are 0, but loyalty card has consumption records .And among these 5 records, the most doubtful part is that Stewart and Sons Fabrication in Day 13 has 4071.95 cost,which is also needed to be noted.
Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find? Please limit your answer to 8 images and 500 words.
To proceed in the Q2, we decide to have data manipulation for another two datasets GPS and car_Assignment.
Data Manipulation for car_Assignment data set(Make up full name)
car_assignment <-car_assignment %>% unite("Full Name", LastName:FirstName, remove = FALSE)
# A tibble: 6 x 4
`Full Name` CarID CurrentEmploymentType CurrentEmploymentTitle
<chr> <dbl> <chr> <chr>
1 Calixto_Nils 1 Information Technology IT Helpdesk
2 Azada_Lars 2 Engineering Engineer
3 Balas_Felix 3 Engineering Engineer
4 Barranco_Ingrid 4 Executive SVP/CFO
5 Baza_Isak 5 Information Technology IT Technician
6 Bergen_Linnea 6 Information Technology IT Group Manager
Data Manipulation for plot out route map
glimpse(GPS)
Rows: 685,169
Columns: 4
$ Timestamp <chr> "01/06/2014 06:28:01", "01/06/2014 06:28:01", "01/~
$ id <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~
bgmap <- raster("Data/MC2-tourist.tif")
bgmap
class : RasterLayer
band : 1 (of 3 bands)
dimensions : 1595, 2706, 4316070 (nrow, ncol, ncell)
resolution : 3.16216e-05, 3.16216e-05 (x, y)
extent : 24.82419, 24.90976, 36.04499, 36.09543 (xmin, xmax, ymin, ymax)
crs : +proj=longlat +datum=WGS84 +no_defs
source : MC2-tourist.tif
names : MC2.tourist
values : 0, 255 (min, max)
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255)

Abila_st <- st_read(dsn = "Data/Geospatial",
layer = "Abila")
Reading layer `Abila' from data source
`C:\linanyaogaibian\Dataviz_blog\_posts\2021-07-13-assignment\Data\Geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS: WGS 84
GPS$id <- as_factor(GPS$id)
glimpse(GPS)
Rows: 685,169
Columns: 5
$ Timestamp <dttm> 2014-01-06 06:28:01, 2014-01-06 06:28:01, 2014-01~
$ id <fct> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~
$ day <fct> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,~
GPS_sf <- st_as_sf(GPS,
coords = c("long", "lat"),
crs= 4326)
GPS_sf
Simple feature collection with 685169 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 24.82509 ymin: 36.04802 xmax: 24.90849 ymax: 36.08996
Geodetic CRS: WGS 84
# A tibble: 685,169 x 4
Timestamp id day geometry
* <dttm> <fct> <fct> <POINT [°]>
1 2014-01-06 06:28:01 35 6 (24.87469 36.07623)
2 2014-01-06 06:28:01 35 6 (24.8746 36.07622)
3 2014-01-06 06:28:03 35 6 (24.87444 36.07621)
4 2014-01-06 06:28:05 35 6 (24.87425 36.07622)
5 2014-01-06 06:28:06 35 6 (24.87417 36.07621)
6 2014-01-06 06:28:07 35 6 (24.87406 36.07619)
7 2014-01-06 06:28:09 35 6 (24.87391 36.07619)
8 2014-01-06 06:28:10 35 6 (24.87381 36.07618)
9 2014-01-06 06:28:11 35 6 (24.87374 36.07617)
10 2014-01-06 06:28:12 35 6 (24.87362 36.07618)
# ... with 685,159 more rows
gps_path <- GPS_sf %>%
group_by(id,day) %>%
summarize(m = mean(Timestamp),
do_union=FALSE) %>%
st_cast("LINESTRING")
gps_path
Simple feature collection with 508 features and 3 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: 24.82509 ymin: 36.04802 xmax: 24.90849 ymax: 36.08996
Geodetic CRS: WGS 84
# A tibble: 508 x 4
# Groups: id [40]
id day m geometry
<fct> <fct> <dttm> <LINESTRING [°]>
1 1 6 2014-01-06 15:02:08 (24.88258 36.06646, 24.88259 36.06~
2 1 7 2014-01-07 12:41:07 (24.87957 36.04803, 24.87957 36.04~
3 1 8 2014-01-08 14:35:25 (24.88265 36.06643, 24.88266 36.06~
4 1 9 2014-01-09 12:04:45 (24.88261 36.06646, 24.88257 36.06~
5 1 10 2014-01-10 16:04:58 (24.88265 36.0665, 24.88261 36.066~
6 1 11 2014-01-11 16:18:32 (24.88258 36.06651, 24.88246 36.06~
7 1 12 2014-01-12 13:31:05 (24.88259 36.06643, 24.8824 36.066~
8 1 13 2014-01-13 13:46:15 (24.88265 36.06642, 24.8826 36.066~
9 1 14 2014-01-14 14:04:23 (24.88261 36.06644, 24.88262 36.06~
10 1 15 2014-01-15 15:33:54 (24.88263 36.06647, 24.88257 36.06~
# ... with 498 more rows
gps_path_selected <- gps_path2 %>%
group_by(id,day)%>%
filter(day==13
)
tmap_mode("view")
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_path_selected) +
tm_lines()
Q3 Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data? Please limit your answer to 8 images and 500 words.
Answer: To answer Q3’s question, we need to build a relationship between credit card & loyalty card owner and car owner. So this relationship connection is based on parking car site and location in the map. We need to find a logical time gap that can be the proof to help real location site to match parking car site.
So first of all, restructure GPS data set to create minute gap and build a box plot to find the appropriate time gap for parking car.
GPS_track <- GPS_sf %>%
dplyr::arrange(day, Timestamp) %>%
group_by(id,day) %>%
mutate(diff = Timestamp - lag(Timestamp),
diff_mins = as.numeric(diff, units = 'mins'))
Now we want to justify which time gap is suitable for a parking time period, we build up a box plot based on dif_mins column.
boxplot1=ggplot(GPS_track,aes(x="",y=GPS_track$diff_mins))+geom_boxplot()+labs(title="Distribution of time gap")+theme_classic()
ggplotly(boxplot1)
Since the boxplot shows that more than 3/4 data points of dif_mins column are 0, it is useless for us to have a justification for parking time,but we also find one car’s movement is unusual and its time gap is 1058.27min. So we can only assume that data points whose dif_mins >5 are recognized as parking point. Based on this assumption, we build up a new column called parking point.
GPS_track$point <- 0
GPS_track$point[GPS_track$diff_mins >5 ] <- 1
GPS_track$point[GPS_track$diff_mins <=5 ] <- 0
glimpse(GPS_track)
Rows: 685,169
Columns: 8
Groups: id, day [508]
$ Timestamp <dttm> 2014-01-06 06:28:01, 2014-01-06 06:28:01, 2014-01~
$ id <fct> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ day <fct> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,~
$ geometry <POINT [°]> POINT (24.87469 36.07623), POINT (24.8746 36~
$ diff <drtn> NA secs, 0 secs, 2 secs, 2 secs, 1 secs, 1 secs, ~
$ diff_mins <dbl> 0.00, 0.00, 0.03, 0.03, 0.02, 0.02, 0.03, 0.02, 0.~
$ hour <chr> "06", "06", "06", "06", "06", "06", "06", "06", "0~
$ point <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
Aggregate the points so that we can know one car driver in one day park its car in how many places?
glimpse(Tracker_GPS)
Rows: 2,966
Columns: 4
$ id <fct> 3, 19, 26, 29, 1, 28, 1, 16, 1, 15, 16, 1, 15, 24, 16, ~
$ day <fct> 11, 11, 11, 12, 16, 19, 7, 7, 7, 7, 7, 9, 9, 9, 11, 11,~
$ hour <chr> "00", "00", "00", "00", "00", "00", "01", "01", "03", "~
$ x <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0~
glimpse(car_assign2)
Rows: 44
Columns: 4
$ `Full Name` <chr> "Calixto_Nils", "Azada_Lars", "Balas_~
$ CarID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12~
$ CurrentEmploymentType <chr> "Information Technology", "Engineerin~
$ CurrentEmploymentTitle <chr> "IT Helpdesk", "Engineer", "Engineer"~
car_assign2$CarID <- as_factor(car_assign2$CarID)
Car_owner <- full_join(Tracker_GPS,car_assign2, by = c("id" ="CarID"))
glimpse(loyalty_data)
Rows: 1,392
Columns: 5
$ timestamp <chr> "1/8/2014", "1/8/2014", "1/14/2014", "1/9/2014",~
$ location <chr> "Carlyle Chemical Inc.", "Carlyle Chemical Inc."~
$ price <dbl> 4983.52, 4901.88, 4898.39, 4792.50, 4788.22, 474~
$ loyaltynum <chr> "L8477", "L5756", "L2769", "L3317", "L8477", "L5~
$ count_event <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
credit_debit$hour<- format(credit_debit$timestamp, format="%H")
glimpse(credit_debit)
Rows: 1,490
Columns: 6
$ timestamp <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-~
$ location <chr> "Brew've Been Served", "Hallowed Grounds", "Brew~
$ price <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.~
$ last4ccnum <dbl> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, ~
$ count_event <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
$ hour <chr> "07", "07", "07", "07", "07", "07", "07", "07", ~
credit_debit$last4ccnum <- as.factor(credit_debit$last4ccnum)
glimpse(cd_people)
Rows: 1,481
Columns: 4
$ last4cnum <fct> 8156, 5407, 3484, 8332, 9551, 9551, 2142, 2681, 34~
$ day <fct> 12, 13, 19, 19, 19, 13, 6, 6, 6, 6, 6, 6, 6, 6, 6,~
$ hour <chr> "03", "03", "03", "03", "03", "06", "07", "07", "0~
$ x <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
glimpse(loyalty_people)
Rows: 606
Columns: 3
$ loyalnum <fct> L1107, L1485, L1682, L2070, L2169, L2247, L2343, L2~
$ day <fct> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ~
$ x <dbl> 2, 2, 3, 2, 3, 3, 3, 4, 1, 3, 2, 3, 3, 3, 3, 1, 3, ~
car_cl<-inner_join(cd_people,Car_owner,by=c("day","x","hour"))
Q5. Do you see evidence of suspicious activity? Identify1- 10 locations where you believe the suspicious activity is occurring,and why Please limit your response to 10 images and 500 words.
First suspicious activity Day 13 has a large amount of money difference that reaches 9912.43 in the Fry dos Auto Supply,although the number of cost records are the same(4).

Second suspicious activity